Background

According to the World Health Organization, every year, about 800,000 people die due to suicide. In this project, we are wondering the trend of suicide rate across years.

Data

The main dataset for our project is a combined dataset from summary datasets made by United Nations Development Program, World Bank, Kaggle, and World Health Organization. It can be access at here. The dataset has a size of 27660 observations and 11 features. Features include:

Exploratory data analysis (EDA)

General Findings

Trend of Global Suicide Rate over time

Before 1995, the suicide rate at the global level is increasing, but since then, it keeps decreasing.

trend with sex

We found that surprisingly, male has higher rate of suicide than female since 1985. Female suicide rate has a very stable trend throughout the history, while there were dramatic changes for male.

p <- maindata%>%
  group_by(year, sex) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>%
  ggplot(aes(x = year, y = suicide_per_100k, col = factor(sex))) + 
  geom_line() + 
  geom_point() + 
  labs(title = "Trends Over Time, by Sex", 
       x = "Year", 
       y = "Suicides per 100k", 
       color = "Sex") + 
  scale_x_continuous(breaks = seq(1985, 2015, 5), minor_breaks = F)


p + transition_reveal(year)

trend with age

Suicide rates for the youngest age group nearly constant and low over time. As the graph shown, elder groups have had higher suicide rate since 1985, and surprisingly such trend has not changed once.

p <- maindata %>% group_by(year, age) %>%
  summarize(suicide_per_100k = (sum(as.numeric(suicides_no)) / sum(as.numeric(population))) * 100000) %>% 
  ggplot(aes(year, suicide_per_100k, fill = age)) +
  geom_density(stat = "identity",alpha=0.2) +
  labs(y = "Suicide rate", title = "Suicide rates across age group over time")

p

Suicide Rate by Country GDP

GDP has been viewed as a good measure about the development of a country. However, graph below shows that there are no obvious trend between GDP and suicide rate. Although GDPs across the world have been shifted toward larger direction, such trend persists.

Countries with most suicides across the years

suicide_sub<-maindata %>% select(country,year,sex,suicides_no)


n<-unique(suicide_sub$country)
country<-function(x){
  suicide2<-suicide_sub %>% filter(country==x)
  sum(suicide2$suicides_no)
}

country_total<-sapply(n,function(x) country(x))

df<-do.call(rbind,Map(data.frame,Country=n,Total_Suicides=country_total))
df2<-df %>% arrange(desc(Total_Suicides))
df3<-head(df2,n=10)

top_suicide<-suicide_sub %>% filter(country==c('Russian Federation',
                                               'United States',
                                               'Japan','France',
                                               'Ukraine',
                                               'Germany',
                                               'Republic of Korea',
                                               'Brazil',
                                               'Poland',
                                               'United Kingdom'))

top_suicide2<-top_suicide
top_suicide2$sex<-as.factor(top_suicide2$sex)

sm3<-aggregate(suicides_no~country+year,top_suicide2,sum)

sm4<-sm3 %>% group_by(year) %>% mutate(rank = min_rank(-suicides_no) * 1) %>%
 ungroup()

static_plot<-ggplot(sm4,aes(rank,group=country,fill=as.factor(country),color=as.factor(country))) +
 geom_tile(aes(y = suicides_no/2,height = suicides_no, width = 0.9), alpha = 0.8, color = NA) +
 geom_text(aes(y = 0, label = paste(country, ' ')), vjust = 0.2, hjust = 1) +
 geom_text(aes(y=suicides_no,label = paste(' ',suicides_no)), hjust=0)+
 coord_flip(clip = 'off', expand = TRUE) +
 scale_y_continuous(labels = scales::comma) +
 scale_x_reverse() +
 guides(color = FALSE, fill = FALSE) +
 theme_minimal() +
 theme(
 plot.title=element_text(size=25, hjust=0.5, face='bold', colour='grey', vjust=-1),
 plot.subtitle=element_text(size=18, hjust=0.5, face='italic', color='grey'),
 plot.caption =element_text(size=8, hjust=0.5, face='italic', color='grey'),
 axis.ticks.y = element_blank(), 
 axis.text.y = element_blank(), 
 plot.margin = margin(1,1,1,4, 'cm')
 )

plt<-static_plot + transition_states(states = year, transition_length = 4, state_length = 1) + 
 ease_aes('cubic-in-out') +
 #view_follow(fixed_x = TRUE) +
 labs(title = 'Total Suicides per Year : {closest_state}', 
 subtitle = 'Top 10 Countries',
 caption = 'Data Source: World Bank Data',
 x='Countries',y='Total Suicides per year')

final_animation<-animate(plt,100,fps = 20,duration = 30, width = 950, height = 750, renderer = gifski_renderer())

final_animation

Inference

correlation for numeric variables

# From the correlation plot on numeric variables, we can see that there are correlation between gdp for year, gdp per capita and population because the calculation of gdp per capita. Thus, in the model we included both gdp indicators but not the population because we think these two represents different things for comparison between countries.

# cannot run 
# errir nessgae:  Error in cor(corr_data) : 'x' must be numeric
#corr_data = maindata %>% 
#  select(-c(1,3,4,8,11,12,13)) 
#corr= cor(corr_data)
#corrplot(corr, type = "upper", order = "hclust", 
#         tl.col = "black", tl.srt = 45)
# summary(maindata)
hist(maindata$suicides_100k_pop)

hist(maindata$gdp_for_year)

suicide = maindata %>% 
  mutate(suicides_100k_pop = suicides_100k_pop+1)

From the distribution plot of suicide_100k_pop, we can see that we need to transform it to satisfy the assumptions for linear model. we used log transformation and changed 0’s to 0.01 for further calculations. Following graphs show the effect of transformation.

hist(log(maindata$suicides_100k_pop))

bigger model

maindata_try = maindata %>% 
  mutate(log_suicide =log(suicides_100k_pop))
maindata_try$log_suicide[is.infinite(maindata_try$log_suicide)]=0.01

model1=lm(log_suicide~year+sex*age+gdp_per_capita, data = maindata_try)

plot(model1)

summary(model1)
## 
## Call:
## lm(formula = log_suicide ~ year + sex * age + gdp_per_capita, 
##     data = maindata_try)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6819 -0.6122  0.1880  0.6921  3.2762 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.041e+01  1.614e+00  12.650  < 2e-16 ***
## year             -9.692e-03  8.073e-04 -12.006  < 2e-16 ***
## sexmale           1.097e+00  3.133e-02  35.026  < 2e-16 ***
## age25-34          8.269e-02  3.133e-02   2.640 0.008301 ** 
## age35-54          2.700e-01  3.133e-02   8.618  < 2e-16 ***
## age5-14          -1.625e+00  3.133e-02 -51.870  < 2e-16 ***
## age55-74          3.531e-01  3.133e-02  11.273  < 2e-16 ***
## age75+            4.404e-01  3.133e-02  14.058  < 2e-16 ***
## gdp_per_capita    6.678e-06  3.603e-07  18.534  < 2e-16 ***
## sexmale:age25-34  2.768e-01  4.430e-02   6.247 4.24e-10 ***
## sexmale:age35-54  2.562e-01  4.430e-02   5.783 7.43e-09 ***
## sexmale:age5-14  -8.448e-01  4.430e-02 -19.069  < 2e-16 ***
## sexmale:age55-74  1.647e-01  4.430e-02   3.718 0.000201 ***
## sexmale:age75+    2.616e-01  4.430e-02   5.905 3.58e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.063 on 27646 degrees of freedom
## Multiple R-squared:  0.5109, Adjusted R-squared:  0.5107 
## F-statistic:  2221 on 13 and 27646 DF,  p-value: < 2.2e-16